Numba JIT Compilation
The Trap That Silently Murders Performance
This code looks like it uses Numba:
from numba import jit
import numpy as np
@jit
def compute_distance_matrix(points):
"""Compute pairwise Euclidean distance matrix."""
n = len(points)
result = np.zeros((n, n))
for i in range(n):
for j in range(n):
dx = points[i][0] - points[j][0]
dy = points[i][1] - points[j][1]
result[i][j] = (dx**2 + dy**2)**0.5
return result
points = [(float(i), float(i*2)) for i in range(1000)] # list of tuples
import time
start = time.perf_counter()
for _ in range(10):
dist = compute_distance_matrix(points)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f}s")
Time: 5.83s
Slower than pure Python on the same algorithm (4.21s). That is the trap. Numba fell back to object mode because points is a list of tuples - not a type Numba can compile natively. It compiled the code, but to a slow interpreted form that adds overhead instead of removing it.
Now the correct version:
from numba import njit
import numpy as np
@njit # equivalent to @jit(nopython=True) - FAILS LOUDLY if types are wrong
def compute_distance_matrix_fast(points: np.ndarray) -> np.ndarray:
"""Compute pairwise Euclidean distance matrix - NumPy array input."""
n = points.shape[0]
result = np.zeros((n, n), dtype=np.float64)
for i in range(n):
for j in range(n):
dx = points[i, 0] - points[j, 0]
dy = points[i, 1] - points[j, 1]
result[i, j] = (dx*dx + dy*dy)**0.5
return result
points_np = np.array([(float(i), float(i*2)) for i in range(1000)])
# Warm up (first call compiles)
compute_distance_matrix_fast(points_np)
start = time.perf_counter()
for _ in range(10):
dist = compute_distance_matrix_fast(points_np)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f}s")
Time: 0.031s (188x faster than the broken @jit version)
And with parallel=True:
from numba import njit, prange
@njit(parallel=True)
def compute_distance_matrix_parallel(points: np.ndarray) -> np.ndarray:
n = points.shape[0]
result = np.zeros((n, n), dtype=np.float64)
for i in prange(n): # parallel outer loop
for j in range(n):
dx = points[i, 0] - points[j, 0]
dy = points[i, 1] - points[j, 1]
result[i, j] = (dx*dx + dy*dy)**0.5
return result
compute_distance_matrix_parallel(points_np) # warm up
start = time.perf_counter()
for _ in range(10):
dist = compute_distance_matrix_parallel(points_np)
elapsed = time.perf_counter() - start
print(f"Time: {elapsed:.3f}s")
Time: 0.009s (648x faster than the broken @jit version, 8-core machine)
| Version | Time | Notes |
|---|---|---|
@jit with list of tuples | 5.83 s | Object mode - DO NOT USE |
| Pure Python (no Numba) | 4.21 s | Baseline |
@njit with NumPy array | 0.031 s | 136x over Python |
@njit(parallel=True) 8 cores | 0.009 s | 468x over Python |
The @jit decorator without nopython=True is a footgun. Always use @njit or @jit(nopython=True).
What You Will Learn
- Understand how Numba compiles Python to LLVM IR and native machine code
- Use
@njitandcache=Truecorrectly for persistent compilation caching - Parallelise loops automatically with
@njit(parallel=True)andprange - Understand Numba's type system and avoid the object mode trap
- Create NumPy ufuncs from Python functions with
@vectorize - Use
@guvectorizefor sliding-window and reduction operations - Write GPU kernels with
@cuda.jit(optional - requires CUDA GPU) - Choose between Numba, Cython, and NumPy for different workloads
Prerequisites
| Requirement | Level Needed |
|---|---|
| NumPy arrays and indexing | Comfortable |
| Python decorators | Comfortable |
| Basic threading/parallelism ideas | Helpful |
| CUDA concepts (Section 7 only) | Helpful |
Section 1: How Numba Works
Numba is a just-in-time (JIT) compiler. Unlike Cython (which compiles to C at build time), Numba compiles Python functions to native machine code the first time they are called with specific argument types, using the LLVM compiler infrastructure.
First call: compute_distance_matrix_fast(points_np)
│
▼
Numba inspects argument types:
points → numpy.ndarray, dtype=float64, ndim=2
│
▼
Numba generates LLVM IR (intermediate representation):
define double* @compute_distance_matrix_fast(...) {
%n = ...
for.loop:
%dx = fsub double %xi, %xj
...
}
│
▼
LLVM optimises and compiles to x86-64 machine code
│
▼
Compiled function is cached (in memory and optionally on disk)
│
▼
Subsequent calls: directly execute native code - zero Python overhead
nopython Mode vs object Mode
| Mode | Behaviour | Performance |
|---|---|---|
nopython | All types resolved to C types, no Python object access | Native C speed |
object | Falls back to Python interpreter for unknown types | Slower than pure Python |
In object mode, Numba still compiles the function but inserts calls back to the Python interpreter wherever it encounters types it does not understand. The compilation overhead is paid (first call is slow) but the runtime overhead is worse than pure Python (redundant dispatch through Numba's object layer).
Always use @njit (nopython=True). If Numba cannot compile in nopython mode, you get a clear error message at first call - not silent slowness.
Section 2: @njit and Compilation Caching
Basic @njit
from numba import njit
import numpy as np
import time
@njit
def exponential_moving_average(data: np.ndarray, alpha: float) -> np.ndarray:
"""
Exponential moving average - cannot be vectorised with NumPy alone
because each output depends on the previous one.
"""
n = data.shape[0]
result = np.empty(n, dtype=np.float64)
result[0] = data[0]
for i in range(1, n):
result[i] = alpha * data[i] + (1.0 - alpha) * result[i - 1]
return result
# Measure compilation time vs execution time
data = np.random.randn(1_000_000)
t0 = time.perf_counter()
out = exponential_moving_average(data, 0.1) # includes compilation
t1 = time.perf_counter()
out = exponential_moving_average(data, 0.1) # compiled - fast
t2 = time.perf_counter()
print(f"First call (compile + run): {(t1-t0)*1000:.1f} ms")
print(f"Second call (run only): {(t2-t1)*1000:.3f} ms")
First call (compile + run): 847.3 ms
Second call (run only): 0.8 ms
The compilation cost (847ms) is paid once per Python interpreter session per unique type signature. This is the JIT warmup cost.
cache=True - Persistent Disk Cache
@njit(cache=True) # save compiled code to __pycache__/
def exponential_moving_average(data: np.ndarray, alpha: float) -> np.ndarray:
n = data.shape[0]
result = np.empty(n, dtype=np.float64)
result[0] = data[0]
for i in range(1, n):
result[i] = alpha * data[i] + (1.0 - alpha) * result[i - 1]
return result
With cache=True, Numba saves the compiled LLVM bitcode to __pycache__/. On the next interpreter startup, the cache is loaded instead of recompiling. The first call is still slightly slower (cache loading) but not 847ms.
Cache is invalidated automatically when the source code changes.
Inspecting Compiled Types
# After the function has been called at least once:
exponential_moving_average.inspect_types()
Output:
exponential_moving_average (array(float64, 1d, C), float64)
--------------------------------------------------------------------------------
# File: ema.py
# --- LINE 5 ---
# label 0
# data = arg(0, name=data) :: array(float64, 1d, C)
# alpha = arg(1, name=alpha) :: float64
# n = data.shape[0] :: int64
# result = np.empty(n, ...) :: array(float64, 1d, C)
# result[0] = data[0] :: (store)
This shows exactly what types Numba inferred for each variable. If any variable shows pyobject, it is in object mode and will be slow.
Multiple Type Specialisations
@njit(cache=True)
def add_scalar(arr, val):
result = np.empty_like(arr)
for i in range(arr.shape[0]):
result[i] = arr[i] + val
return result
a_f32 = np.ones(100, dtype=np.float32)
a_f64 = np.ones(100, dtype=np.float64)
a_i64 = np.ones(100, dtype=np.int64)
add_scalar(a_f32, np.float32(1.0)) # compiles for (float32, float32)
add_scalar(a_f64, 1.0) # compiles for (float64, float64)
add_scalar(a_i64, 1) # compiles for (int64, int64)
# Each specialisation is separate compiled code
print(add_scalar.signatures)
# [(array(float32, 1d, C), float32),
# (array(float64, 1d, C), float64),
# (array(int64, 1d, C), int64)]
Each unique combination of argument types triggers a separate compilation. Numba caches all specialisations.
Section 3: @njit(parallel=True) and prange
Numba's parallel mode automatically parallelises loops across CPU cores using Intel TBB (Threading Building Blocks) or OpenMP. Unlike Cython's prange which requires manual setup, Numba handles thread management transparently.
from numba import njit, prange
import numpy as np
@njit(parallel=True, cache=True)
def parallel_rolling_sum(data: np.ndarray, window: int) -> np.ndarray:
"""
Parallel rolling sum - each output position is independent.
prange distributes iterations across CPU cores automatically.
"""
n = data.shape[0]
out_len = n - window + 1
result = np.empty(out_len, dtype=np.float64)
for i in prange(out_len): # parallel: each i runs on a different thread
total = 0.0
for j in range(window):
total += data[i + j]
result[i] = total
return result
Parallel Reduction
@njit(parallel=True)
def parallel_sum(arr: np.ndarray) -> float:
"""
Parallel sum using Numba's automatic reduction detection.
Numba detects that 'total += arr[i]' is a reduction and
creates thread-private totals that are combined at the end.
"""
total = 0.0
for i in prange(arr.shape[0]):
total += arr[i]
return total
Parallel Image Filter
@njit(parallel=True, cache=True)
def gaussian_blur_2d(
image: np.ndarray, # shape: (H, W)
kernel: np.ndarray, # shape: (k, k)
) -> np.ndarray:
"""
Apply a 2D convolution kernel to an image.
Outer loop (rows) is parallelised across threads.
"""
H, W = image.shape
k = kernel.shape[0]
pad = k // 2
result = np.zeros_like(image)
for row in prange(pad, H - pad): # parallel over rows
for col in range(pad, W - pad):
val = 0.0
for ki in range(k):
for kj in range(k):
val += image[row - pad + ki, col - pad + kj] * kernel[ki, kj]
result[row, col] = val
return result
# Usage
import numpy as np
from numba import njit, prange
image = np.random.rand(1080, 1920).astype(np.float64)
kernel = np.ones((5, 5), dtype=np.float64) / 25.0 # box blur
# Warm up
gaussian_blur_2d(image, kernel)
import time
start = time.perf_counter()
for _ in range(10):
blurred = gaussian_blur_2d(image, kernel)
elapsed = time.perf_counter() - start
print(f"10 runs: {elapsed:.3f}s ({elapsed/10*1000:.1f}ms each)")
10 runs: 0.234s (23.4ms each) - on 8 cores
Equivalent Python nested loops: ~45 seconds per iteration.
prange Safety Rules
prange is safe only when:
- Iterations are independent -
result[i]does not depend onresult[i-1] - No shared write locations (write races)
- Reductions are to a single scalar variable (Numba handles these automatically)
Unsafe example:
@njit(parallel=True)
def UNSAFE_cumsum(arr):
result = np.zeros_like(arr)
for i in prange(arr.shape[0]):
result[i] = result[i-1] + arr[i] # RACE: reads result[i-1] while another thread writes it
return result
This produces silently wrong answers without raising an error.
Section 4: Numba's Type System
Understanding what Numba can and cannot compile is critical to avoiding object mode.
Supported Types
| Python/NumPy Type | Numba Support | Notes |
|---|---|---|
int, float, bool | Full | Mapped to C int64, double, bool |
complex | Full | complex128 |
numpy.ndarray | Full | Any dtype, any ndim |
numpy.float64 scalars | Full | Used as C doubles |
tuple (homogeneous) | Partial | Fixed-length only |
list (homogeneous) | Partial | Reflected lists - use NumPy arrays |
dict | Partial (typed) | numba.typed.Dict only |
str | No | Not supported in nopython mode |
bytes | No | |
| Arbitrary Python objects | No | Triggers object mode |
pandas.DataFrame | No | Extract NumPy arrays first |
datetime.datetime | Partial | Via numpy.datetime64 |
What Does NOT Work in nopython Mode
from numba import njit
@njit
def FAILS_1(data):
return len(str(data[0])) # str() not supported in nopython
@njit
def FAILS_2(data):
return sorted(data) # sorted() not supported in nopython
@njit
def FAILS_3(data):
import json # imports not supported in nopython
return json.dumps(data.tolist())
@njit
def FAILS_4(data):
return {i: data[i] for i in range(len(data))} # dict comprehension not supported
All of these raise TypingError or NumbaError at first call - which is the correct behaviour. You get the error immediately, not silent slowness.
Numba Typed Containers
For use cases that require mutable containers inside JIT functions:
from numba import njit
from numba.typed import List, Dict
import numba
@njit
def build_filtered_list(data: np.ndarray, threshold: float):
"""Build a filtered list inside a JIT function using Numba's typed List."""
result = List()
result.append(0.0) # type inference from first append
result.pop()
for i in range(data.shape[0]):
if data[i] > threshold:
result.append(data[i])
return result
@njit
def count_by_bucket(data: np.ndarray, n_buckets: int):
"""Histogram using Numba's typed Dict."""
counts = Dict.empty(
key_type=numba.int64,
value_type=numba.int64,
)
for i in range(data.shape[0]):
bucket = int(data[i]) % n_buckets
if bucket in counts:
counts[bucket] += 1
else:
counts[bucket] = 1
return counts
Note: In most cases, using a pre-allocated NumPy array instead of a Numba typed container is faster and simpler.
Section 5: @vectorize - Creating NumPy Ufuncs
@vectorize creates a NumPy universal function (ufunc) from a Python function that operates on scalar values. The resulting ufunc broadcasts automatically over arrays of any shape, just like np.sin, np.exp, etc.
from numba import vectorize
import numpy as np
# @vectorize takes a list of type signatures: output(input1, input2, ...)
@vectorize(['float64(float64)', 'float32(float32)'])
def leaky_relu(x):
"""Leaky ReLU activation - operates on a single scalar."""
return x if x > 0.0 else 0.01 * x
@vectorize(['float64(float64, float64)'])
def huber_loss(prediction, target):
"""
Huber loss - robust to outliers.
δ = 1.0
L = 0.5*(pred-tgt)² if |pred-tgt| ≤ 1 else |pred-tgt| - 0.5
"""
diff = abs(prediction - target)
if diff <= 1.0:
return 0.5 * diff * diff
return diff - 0.5
# These functions now work on arrays of any shape
predictions = np.random.randn(1_000_000)
targets = np.random.randn(1_000_000)
losses = huber_loss(predictions, targets) # vectorised over 1M elements
print(losses.shape) # (1000000,)
print(losses.mean())
Parallel Ufunc
@vectorize(['float64(float64, float64)'], target='parallel')
def huber_loss_parallel(prediction, target):
"""Same function, parallel execution across CPU cores."""
diff = abs(prediction - target)
if diff <= 1.0:
return 0.5 * diff * diff
return diff - 0.5
# target options: 'cpu' (default), 'parallel' (multi-core), 'cuda' (GPU)
Custom Activation Functions - A Practical Example
from numba import vectorize
import numpy as np
@vectorize(['float64(float64, float64, float64)'], target='parallel', cache=True)
def parametric_relu(x, alpha, threshold):
"""PReLU with configurable negative slope and threshold."""
if x >= threshold:
return x
return alpha * (x - threshold)
@vectorize(['float64(float64)'], target='parallel', cache=True)
def selu(x):
"""
Scaled Exponential Linear Unit - requires exact constants
for self-normalising property.
"""
alpha = 1.6732632423543772
scale = 1.0507009873554805
if x > 0.0:
return scale * x
return scale * alpha * (2.718281828459045 ** x - 1.0)
# Benchmark vs NumPy equivalent
data = np.random.randn(5_000_000)
import time
# NumPy implementation
def selu_numpy(x):
alpha = 1.6732632423543772
scale = 1.0507009873554805
return np.where(x > 0, scale * x, scale * alpha * (np.exp(x) - 1.0))
start = time.perf_counter()
for _ in range(20):
out = selu_numpy(data)
print(f"NumPy SELU: {(time.perf_counter()-start)/20*1000:.2f}ms")
selu(data) # warm up
start = time.perf_counter()
for _ in range(20):
out = selu(data)
print(f"Numba SELU: {(time.perf_counter()-start)/20*1000:.2f}ms")
NumPy SELU: 48.23ms
Numba SELU: 12.41ms (parallel, 4-core)
The NumPy version allocates a temporary array for np.exp(x) and another for the np.where result. Numba's ufunc fuses all operations into a single pass - one read per element, one write per element, no temporaries.
Section 6: @guvectorize - Generalised Ufuncs
@guvectorize extends @vectorize to operations that consume or produce arrays of a fixed shape per "element". It is the right tool for sliding windows, normalisation along an axis, and small matrix operations.
from numba import guvectorize
import numpy as np
# Layout string: (n),(n)->(n) means:
# first input is 1D length-n, second input is 1D length-n, output is 1D length-n
@guvectorize(
['void(float64[:], float64[:], float64[:])'],
'(n),(n)->(n)',
target='parallel',
cache=True,
)
def normalise_to_reference(signal, reference, out):
"""
Normalise each channel of signal relative to the corresponding reference.
Operates on 1D slices; guvectorize broadcasts over higher dimensions.
"""
n = signal.shape[0]
ref_sum = 0.0
for i in range(n):
ref_sum += reference[i]
if ref_sum == 0.0:
for i in range(n):
out[i] = 0.0
else:
for i in range(n):
out[i] = signal[i] / ref_sum
# Sliding window operation
@guvectorize(
['void(float64[:], float64[:])'],
'(n)->()', # n-element input → scalar output
target='parallel',
cache=True,
)
def window_max(window, out):
"""Maximum value over a sliding window."""
m = window[0]
for i in range(1, window.shape[0]):
if window[i] > m:
m = window[i]
out[0] = m
Section 7: Numba for CUDA - GPU Kernels (Optional)
Numba's CUDA support lets you write GPU kernels in Python. This section requires a CUDA-capable NVIDIA GPU and the CUDA toolkit installed.
# Check if CUDA is available
python -c "from numba import cuda; print(cuda.gpus)"
The CUDA Programming Model
CPU (Host) GPU (Device)
───────── ─────────────
Python code runs here Parallel kernel runs here
Grid
┌──────────────────────┐
│ Block (0,0) │
│ ┌───┬───┬───┬───┐ │
│ │T0 │T1 │T2 │T3 │ │ ← threads execute in parallel
│ └───┴───┴───┴───┘ │
├──────────────────────┤
│ Block (1,0) │
│ ┌───┬───┬───┬───┐ │
│ │T0 │T1 │T2 │T3 │ │
│ └───┴───┴───┴───┘ │
└──────────────────────┘
Each CUDA kernel invocation launches blocks × threads_per_block parallel threads. Each thread knows its position via cuda.threadIdx and cuda.blockIdx.
GPU Vector Addition
from numba import cuda
import numpy as np
import math
@cuda.jit
def vector_add_gpu(a, b, c):
"""
GPU kernel: each thread adds one element.
Thread index determines which element this thread processes.
"""
# Compute this thread's global index
thread_id = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
if thread_id < a.shape[0]: # bounds check - total threads may exceed array size
c[thread_id] = a[thread_id] + b[thread_id]
def add_on_gpu(a_np: np.ndarray, b_np: np.ndarray) -> np.ndarray:
"""Transfer arrays to GPU, run kernel, transfer result back."""
n = a_np.shape[0]
# Transfer to GPU memory
a_gpu = cuda.to_device(a_np)
b_gpu = cuda.to_device(b_np)
c_gpu = cuda.device_array(n, dtype=np.float64)
# Configure launch: 256 threads per block, enough blocks for all elements
threads_per_block = 256
blocks_per_grid = math.ceil(n / threads_per_block)
# Launch kernel
vector_add_gpu[blocks_per_grid, threads_per_block](a_gpu, b_gpu, c_gpu)
# Transfer result back to CPU
return c_gpu.copy_to_host()
# Test
n = 10_000_000
a = np.random.rand(n)
b = np.random.rand(n)
result_gpu = add_on_gpu(a, b)
result_cpu = a + b
assert np.allclose(result_gpu, result_cpu)
print("GPU result matches CPU result.")
GPU Matrix Multiplication
from numba import cuda
import numpy as np
import math
BLOCK_SIZE = 16 # 16×16 = 256 threads per block
@cuda.jit
def matmul_gpu(A, B, C):
"""
GPU matrix multiplication using shared memory tiling.
Each thread block computes a BLOCK_SIZE × BLOCK_SIZE tile of C.
"""
# Shared memory tiles - allocated per block
tile_A = cuda.shared.array(shape=(BLOCK_SIZE, BLOCK_SIZE), dtype=numba.float64)
tile_B = cuda.shared.array(shape=(BLOCK_SIZE, BLOCK_SIZE), dtype=numba.float64)
row = cuda.blockIdx.y * BLOCK_SIZE + cuda.threadIdx.y
col = cuda.blockIdx.x * BLOCK_SIZE + cuda.threadIdx.x
n = A.shape[0]
tmp = 0.0
for tile_idx in range(math.ceil(n / BLOCK_SIZE)):
# Load tiles into shared memory
tr = cuda.threadIdx.y
tc = cuda.threadIdx.x
if row < n and tile_idx * BLOCK_SIZE + tc < n:
tile_A[tr, tc] = A[row, tile_idx * BLOCK_SIZE + tc]
else:
tile_A[tr, tc] = 0.0
if tile_idx * BLOCK_SIZE + tr < n and col < n:
tile_B[tr, tc] = B[tile_idx * BLOCK_SIZE + tr, col]
else:
tile_B[tr, tc] = 0.0
# Wait for all threads in the block to finish loading
cuda.syncthreads()
# Compute the dot product for this tile
for k in range(BLOCK_SIZE):
tmp += tile_A[tr, k] * tile_B[k, tc]
# Wait before loading next tile
cuda.syncthreads()
if row < n and col < n:
C[row, col] = tmp
When CUDA Helps (and When It Does Not)
| Workload | GPU Benefit | Notes |
|---|---|---|
| Large matrix operations (N > 1000) | 10–100x | Memory bandwidth becomes limiting |
| Batch inference (deep learning) | 10–100x | Use CUDA via PyTorch/TensorFlow, not @cuda.jit |
| Monte Carlo simulation (millions of trials) | 10–50x | Embarrassingly parallel |
| Image processing (large batches) | 10–50x | Pixel operations are embarrassingly parallel |
| Small arrays (N < 100) | Slower | GPU launch overhead dominates |
| Sequential algorithms (depends on prev output) | Little benefit | Not parallelisable |
| I/O-bound work | No benefit | CPU waits on disk/network, not computation |
Section 8: Numba vs Cython vs NumPy - Decision Table
Choose based on your situation, not on what you've used before:
| Criterion | NumPy | Numba @njit | Cython |
|---|---|---|---|
| Build system required | No | No | Yes (C compiler) |
| Zero-overhead at import time | Yes | No (first-call JIT) | Yes (pre-compiled) |
| Arbitrary loop logic | Hard (vectorise it) | Yes | Yes |
| Non-numerical Python objects | No | Limited | Yes (with overhead) |
| GIL release | Yes (some ops) | Yes (nogil=True) | Yes (with nogil:) |
| GPU support | No (use CuPy) | Yes (@cuda.jit) | No |
| Parallel CPU | Via BLAS | Yes (prange) | Yes (prange+OpenMP) |
| C library integration | No | No | Yes (cdef extern) |
| Debugging ease | High | Medium | Low (C errors) |
| Works with PyPy | Yes | No | No |
Supports nopython complex workflows | N/A | With typed Dict/List | Yes |
Decision Rule
Is the bottleneck a tight numerical loop over arrays?
YES → Try Numba @njit first (zero build complexity)
If Numba cannot handle the types → use Cython memoryviews
NO → Is the bottleneck expressible as array operations?
YES → NumPy vectorisation
NO → Profile more carefully - may not be CPU-bound
Interview Questions
Q1: What is the difference between @jit and @njit in Numba? Why should you almost always prefer @njit?
@jit is @jit(nopython=False) by default. When Numba encounters types it cannot compile natively - Python lists, strings, arbitrary objects - it silently falls back to object mode: a compiled but slow form that still uses the Python interpreter for unsupported operations. The function compiles (pays compilation overhead) but runs slower than pure Python because of redundant dispatch.
@njit is @jit(nopython=True). It raises a TypingError immediately at first call if any variable type cannot be resolved to a native C type. There is no silent fallback.
You should almost always prefer @njit because:
- Silent object mode fallback is a performance footgun - you pay compile overhead but get no speedup
- The error message from
TypingErrortells you exactly which type caused the problem - It enforces discipline around Numba-compatible types upfront
The only reason to use @jit (without nopython=True) is during prototyping when you want partial compilation while you gradually make types compatible.
Q2: Explain Numba's type specialisation. What happens when you call a @njit function with different argument types?
Numba compiles a separate native code version for each unique combination of argument types. The first time f(arr_f64) is called, Numba compiles a version specialised for float64 arrays. The first time f(arr_f32) is called, Numba compiles another version specialised for float32 arrays. These specialisations are stored in a dispatch table.
Subsequent calls with the same types execute the already-compiled native code directly - no Python overhead at all, just a type lookup in the dispatch table (O(1)) and a native function call.
This means:
- First call per type signature pays the compilation cost (often hundreds of milliseconds)
- Subsequent calls are as fast as optimised C code
- Many different type signatures = many separate compilations = longer warmup time
With cache=True, compiled specialisations are saved to __pycache__ and loaded on subsequent interpreter starts, eliminating warmup cost after the first run.
Q3: What are the safety requirements for using prange in Numba? What goes wrong if you violate them?
prange distributes loop iterations across CPU threads. The safety requirements are:
-
Independence:
result[i]must not depend onresult[i-1](or any value computed in another iteration). If iterationi=5readsresult[4]while the thread computingresult[4]has not finished yet, you get a data race: undefined behaviour, silently wrong results. -
No conflicting writes: Two iterations must not write to the same memory location. If they do, whichever thread writes last wins - the other write is lost.
-
Reductions are safe but must be on a single scalar variable:
total += arr[i]inside aprangeloop is automatically detected as a reduction. Numba creates thread-private copies oftotal, accumulates locally, and merges at the end. This is correct. -
No Python objects: The GIL is not held inside
prange. Any Python API call (creating a list, appending to a dict, calling a Python function) will segfault or produce race conditions.
Violations produce silently wrong results - not exceptions. This is the most dangerous aspect of prange. Test parallel code against a serial reference implementation with np.allclose() before trusting it.
Q4: What is the difference between @vectorize and @guvectorize? Give a use case for each.
@vectorize creates a ufunc from a function that takes scalar inputs and returns a scalar output. The scalar function is applied element-wise to arrays, with automatic broadcasting. Example use cases: activation functions (ReLU, SELU), element-wise loss functions (Huber loss), custom clipping functions.
@guvectorize creates a generalised ufunc from a function that takes arrays of specified shapes as inputs and outputs. The layout string specifies the shape contract: '(n)->()' means "take a 1D array of length n, produce a scalar". Example use cases: sliding window aggregation (max, mean over a window), normalisation of each row/column of a matrix, dot product (single row-column pair → scalar).
The key difference: @vectorize processes one scalar per call; @guvectorize processes one array slice per call. For a sliding window maximum over a 2D input (rows, window_size), @vectorize cannot express this naturally (it only handles scalars), but @guvectorize with layout '(n)->()' applies the scalar-producing function to each row independently and broadcasts over the batch dimension.
Q5: You have a function decorated with @njit that processes financial tick data. In production, the first request each morning takes 3 seconds to respond because Numba is recompiling. How do you fix this?
There are three complementary approaches:
1. cache=True - the simplest fix. @njit(cache=True) saves the compiled LLVM bitcode to __pycache__/. On the next interpreter start, the cache is loaded (fast) instead of recompiling. The first call is still slightly slower (cache load + machine code generation from bitcode) but not seconds slower. This eliminates the 3-second compile on every restart.
2. Warmup at startup - call the function with representative dummy data during application startup (before accepting requests), so the compilation happens during the startup phase rather than on the first real request:
# In app startup (before accepting traffic)
import numpy as np
_dummy = np.zeros((1, 10), dtype=np.float64)
process_tick_data(_dummy) # triggers compilation
3. Ahead-of-time compilation with numba.pycc - Numba supports pre-compiling modules to .so files that can be imported like any C extension, eliminating all JIT overhead:
# aot_module.py
from numba.pycc import CC
cc = CC('compiled_ticks')
@cc.export('process_tick_data', 'f8[:](f8[:, :])')
def process_tick_data(data):
...
if __name__ == '__main__':
cc.compile()
In practice, cache=True plus startup warmup resolves most production JIT latency issues. AOT compilation is reserved for environments where deployment constraints prevent JIT (e.g., strict container sandboxes).
